gitindex: replace go-git blob reading with pipelined git cat-file --batch#1021
gitindex: replace go-git blob reading with pipelined git cat-file --batch#1021keegancsmith merged 3 commits intosourcegraph:mainfrom
Conversation
…atch
Replace the serial go-git BlobObject calls in indexGitRepo with a single
pipelined "git cat-file --batch --buffer" subprocess. A writer goroutine
feeds all blob SHAs to stdin while the main goroutine reads responses
from stdout, forming a concurrent pipeline that eliminates per-object
packfile seek overhead and leverages git's internal delta base cache.
Submodule blobs fall back to the existing go-git createDocument path.
Benchmarked on kubernetes (29,188 files, 261 MB), Apple M1 Max, 5 runs:
go-git BlobObject (before):
Time: 2.94s Allocs: 685K Memory: 691 MB
cat-file pipelined (after):
Time: 0.60s Allocs: 58K Memory: 276 MB
Speedup: 4.9x time, 12x fewer allocs, 2.5x less memory
keegancsmith
left a comment
There was a problem hiding this comment.
Awesome stuff. See inline, but the reason I am requesting changes is I want to avoid reading into memory files that we decide to not index due to being too large. I believe you should be able to shape the code written here into something which allows that.
gitindex/index.go
Outdated
| // Build the ordered list of (name, key) pairs and collect blob SHAs for | ||
| // the main repo so we can read them all at once via git cat-file --batch. | ||
| type indexEntry struct { | ||
| key fileKey | ||
| blobIndex int // index into blobResults, or -1 for submodule blobs | ||
| } | ||
| entries := make([]indexEntry, 0, totalFiles) | ||
| mainRepoIDs := make([]plumbing.Hash, 0, totalFiles) |
There was a problem hiding this comment.
I think it makes sense to rather do two loops. One over mainRepoIDs and instead have another list for submoduleIDs. The actual code is quite minimal for submoduleIDs and the duplication then between what we do once we have created a doc is minimal.
There was a problem hiding this comment.
Done — split into two separate loops. mainRepoKeys iterated with the streaming reader, submoduleKeys iterated with go-git createDocument. The builder sorts documents internally via sortDocuments so ordering between the two loops does not matter.
gitindex/index.go
Outdated
| var blobResults []blobResult | ||
| if len(mainRepoIDs) > 0 { | ||
| var err error | ||
| blobResults, err = readBlobsPipelined(opts.RepoDir, mainRepoIDs) |
There was a problem hiding this comment.
I think there is a problem with this current API in that you read everything into memory. This just won't scale for very large repos. It's part of why zoekt works with this design of adding documents that at certain thresholds decide to create a shard (which then rely on the GC to free memory).
Additionally previously if a blob was over our max size, we just wouldn't read it in to memory.
I would suggest instead designing the API to be like the archive/tar API. So you would still pass in a huge slice of mainRepoIDs. But it would then return a stateful struct over the output from the process. For each file we get a little bit of metadata (like size, etc) and then the caller here can read that into a []byte or call .Next to get it to skip over the file.
So I think to keep the nice property around just passing in all IDs, we still get git to output all objects. But here in go land we decide between actually reading the object into a []byte or just io.Discarding it.
You don't have to go crazy on the design of the API. Just do something nice and minimal to suit this use case.
There was a problem hiding this comment.
Agreed, reworked. readBlobsPipelined is replaced by a streaming catfileReader — archive/tar.Reader-style:
cr, _ := newCatfileReader(opts.RepoDir, mainRepoIDs)
for _, key := range mainRepoKeys {
size, missing, _ := cr.Next()
if size > SizeMax { continue } // unread bytes auto-discarded by next Next()
content := make([]byte, size)
io.ReadFull(cr, content)
}Next()reads the header and returns size — content is not read until the caller callsRead- Calling
Next()again without reading discards unread bytes viabufio.Reader.Discard - Large blobs never touch Go memory — only the header is parsed, content is discarded in the bufio buffer
- Peak memory is now bounded by
ShardMax(one shard's worth of content), not total repo size
Replace the bulk readBlobsPipelined (which read all blobs into a
[]blobResult slice) with a streaming catfileReader modeled after
archive/tar.Reader:
cr, _ := newCatfileReader(repoDir, ids)
for {
size, missing, err := cr.Next()
if size > maxSize { continue } // auto-skipped, never read
content := make([]byte, size)
io.ReadFull(cr, content)
}
Next() reads the cat-file header and returns the blob's size. The
caller decides whether to Read the content or skip it — calling Next()
again automatically discards unread bytes via bufio.Reader.Discard.
Large blobs over SizeMax are never allocated or read into Go memory.
Also split the single interleaved loop into two: one for main-repo
blobs streamed via cat-file, one for submodule blobs via go-git's
createDocument. The builder sorts documents internally so ordering
between the loops does not matter.
Peak memory is now bounded by ShardMax (one shard's worth of content)
rather than total repository size.
|
Updated benchmarks after the streaming rewrite (median of 5 runs, Apple M1 Max, kubernetes 29,188 files):
The streaming API is faster than the previous bulk |
keegancsmith
left a comment
There was a problem hiding this comment.
Ok did a much deeper review now. It's looking great! Just have a few inline comments then we should be able to ship this.
| // Writer goroutine: feed all SHAs then close stdin to trigger flush. | ||
| writeErr := make(chan error, 1) | ||
| go func() { | ||
| defer stdin.Close() |
There was a problem hiding this comment.
also defer close(writeErr) so if you call Close twice it won't block forever.
There was a problem hiding this comment.
Done — added defer close(writeErr) in the goroutine. Also wrapped Close() itself in sync.Once (same pattern as DirectoryWatcher.Stop in search/watcher.go) so that cmd.Wait() is also called exactly once. This makes Close fully idempotent — safe to call multiple times or concurrently. Added TestCatfileReader_DoubleClose and TestCatfileReader_ConcurrentClose that verify this.
| } | ||
|
|
||
| // Close shuts down the cat-file process and waits for it to exit. | ||
| func (cr *catfileReader) Close() error { |
There was a problem hiding this comment.
I worry about what happens when we call Close before git is finished giving us all its output. eg if we get an error during builder.Add we will call close before reading everything.
Maybe instead this should just always call something like cr.cmd.Kill() (I can't remember offhand what it is). Then I think all the intermediatte goroutines/etc will drain.
I do quite like that you nicely clean up here, but I want to make sure we have good behaviour for early termination.
There was a problem hiding this comment.
Good call — switched to kill-first. Close() now calls cmd.Process.Kill() before draining. This matches Gitaly's pattern — their cache eviction code has an explicit comment: "We first cancel the context such that the process gets a SIGKILL signal. Calling close() first may lead to a deadlock given that it waits for the process to exit, which may not happen if it hangs writing data to stdout."
After Kill, the drain completes instantly (stdout pipe gets EOF), the writer goroutine gets EPIPE on its next write and exits, and cmd.Wait() returns. The expected "signal: killed" exit error is suppressed via isKilledErr().
Added TestCatfileReader_CloseBeforeExhausted_ManyBlobs (200 files, reads only 1, then Close) and TestCatfileReader_CloseWithoutReading to exercise this.
gitindex/catfile.go
Outdated
| if lastSpace == -1 { | ||
| return 0, false, fmt.Errorf("unexpected header: %q", header) | ||
| } | ||
| size, err = strconv.ParseInt(string(header[lastSpace+1:]), 10, 64) |
There was a problem hiding this comment.
minor: lets parse this as an int instead of int64 since all call sites work with int and then you can avoid the cast to int when you do Discard
There was a problem hiding this comment.
Done — Next() now returns int, pending field is int, uses strconv.Atoi. Removed all casts at call sites (Discard, size comparison against SizeMax, make([]byte, size)).
gitindex/index.go
Outdated
| var doc index.Document | ||
|
|
||
| if missing { | ||
| doc = skippedLargeDoc(key, branches) |
There was a problem hiding this comment.
have you come across missing in your tests? Saying we skipped it due to large file doesn't seem correct. We should either fail or if this has a common reason add a new indexing reason for missing a document.
There was a problem hiding this comment.
You're right — added SkipReasonMissing with explanation "object missing from repository". Missing blobs now get skippedDoc(key, branches, index.SkipReasonMissing) plus a warning log. This is unexpected for local repos (could indicate corruption, shallow clone, or a race with git gc), so the warning helps operators diagnose it.
Updated file_category.go to handle SkipReasonMissing the same way as TooLarge — guess category from filename, don't examine content.
The go-git createDocument path keeps using SkipReasonTooLarge for ErrObjectNotFound since that has a different semantic (large objects excluded during fetch).
gitindex/index.go
Outdated
| doc = skippedLargeDoc(key, branches) | ||
| } else { | ||
| content := make([]byte, size) | ||
| if _, err := io.ReadFull(cr, content); err != nil { |
There was a problem hiding this comment.
minor: unusual to repeatadly call io.ReadFull on the same reader. You might be able to expose a nicer api with the use of io.LimitedReader. But this is a very minor point, this does work!
There was a problem hiding this comment.
Considered this — io.ReadFull into a pre-allocated buffer is actually more efficient here since the size is known: zero intermediate allocations vs io.ReadAll(io.LimitReader(...)) which grows a buffer internally. Added a comment explaining the choice. The catfileReader already limits reads internally via the pending field (same approach as archive/tar.Reader), so an extra LimitedReader wrapper would be redundant.
There was a problem hiding this comment.
What I meant is the catfileReader API exposes a LimitedReader rather than directly using it's Read implementation. But yeah, this all works so was more just me thinking it through. For example I think that is how the archive reader API likely works.
|
|
||
| // Stream main-repo blobs via pipelined cat-file --batch --buffer. | ||
| // Large blobs are skipped without reading content into memory. | ||
| if len(mainRepoIDs) > 0 { |
There was a problem hiding this comment.
Feels a bit error prone the number of places we need to remember to do cr.Close inside of this if statement. Maybe this whole if statement body can be factored out into a function, then we can do defer cr.Close?
There was a problem hiding this comment.
Done — extracted indexCatfileBlobs() with defer cr.Close(). This eliminated all four manual cr.Close() calls (three error paths + one happy path). Any new error path added in the future is automatically covered by the defer.
gitindex/index.go
Outdated
|
|
||
| for _, name := range names { | ||
| for _, key := range fileKeys[name] { | ||
| if key.SubRepoPath == "" { |
There was a problem hiding this comment.
I have a slight concern this breaks stuff. For a few weeks can we have an envvar switch off this behaviour (I guess if turned off then if this if statement is always false it will just work out?)
Then after a few weeks we remove the envvar. But this code path is pretty critical on the server side both at Sourcegraph and GitLab.
There was a problem hiding this comment.
Done — added ZOEKT_DISABLE_CATFILE_BATCH. Follows the exact same pattern as ZOEKT_DISABLE_GOGIT_OPTIMIZATION already in this function: cmp.Or(os.Getenv(...), "false") + strconv.ParseBool. When set to true, the if key.SubRepoPath == "" condition is effectively always false, so all keys go through the go-git createDocument path. Benchmarked both paths on kubernetes — cat-file path is 4.4x faster, go-git fallback works correctly.
…issing Address review feedback on PR sourcegraph#1021: - Make Close() idempotent via sync.Once; kill the git process first (matching Gitaly's pattern) instead of draining all remaining stdout, so early termination is fast. Suppress the expected SIGKILL exit error. Add defer close(writeErr) in the writer goroutine to prevent deadlock on double-close. - Change Next() return and pending field from int64 to int, use strconv.Atoi. Removes casts at all call sites; SizeMax is already int. - Add SkipReasonMissing for blobs that git cat-file reports as missing, instead of reusing SkipReasonTooLarge. Missing is unexpected for local repos (corruption, shallow clone, gc race) so log a warning. - Extract indexCatfileBlobs() with defer cr.Close(), eliminating four manual Close() calls on error paths. - Add ZOEKT_DISABLE_CATFILE_BATCH env var kill switch following the existing ZOEKT_DISABLE_GOGIT_OPTIMIZATION pattern. When set, all blobs fall back to the go-git createDocument path. - Deduplicate skippedLargeDoc/skippedMissingDoc into skippedDoc(reason). - Add 19 hardening tests covering Close lifecycle (double close, concurrent close, early termination), Read edge cases (partial reads, 1-byte buffer, empty blobs, read-without-next), missing object sequences, large blob byte precision, and duplicate SHAs. Benchmarked on kubernetes (29,188 files): no performance regression (geomean -0.89%, within noise).
|
Addressed all inline comments from the second review — replied to each one individually. Summary of changes in Hardening:
Correctness:
Testing:
Performance:
|
gitindex/index.go
Outdated
| doc = skippedLargeDoc(key, branches) | ||
| } else { | ||
| content := make([]byte, size) | ||
| if _, err := io.ReadFull(cr, content); err != nil { |
There was a problem hiding this comment.
What I meant is the catfileReader API exposes a LimitedReader rather than directly using it's Read implementation. But yeah, this all works so was more just me thinking it through. For example I think that is how the archive reader API likely works.
Summary
go-gitBlobObjectcalls inindexGitRepowith a single pipelinedgit cat-file --batch --buffersubprocess--bufferswitches git from per-objectwrite_or_dieflush to libc stdio buffering, reducing syscallscreateDocumentpathBenchmarks
On kubernetes (29,188 files, 261 MB content), Apple M1 Max, median of 5 runs:
Reproducing:
Context
Filed as #1016. seek wraps zoekt as a local CLI for AI coding agents — every search is a blocking tool call, so cold-index speed matters.
Prior art: PR #424 benchmarked
git cat-file --batchat 2.68x faster than go-git. PR #940 states "eventually we want to move away from go-git towards a vanilla direct git usage."Design
The pipelined reader (
readBlobsPipelinedingitindex/catfile.go):git cat-file --batch --bufferwith the repo as working directorybufio.Writer, then closes stdin to trigger git's finalfflush(stdout)bufio.Reader, parsing the<oid> <type> <size>\n<content>\nprotocolcmd.Wait()blobResult.Missing(matches existingErrObjectNotFound→skippedLargeDocbehavior)Integration in
indexGitRepo(gitindex/index.go):SubRepoPath == "") from submodule keysreadBlobsPipelinedonceindex.Documentfrom pipelined results for main-repo blobscreateDocumentfor submodule blobsPer-blob allocations minimized to 1 (the content
[]byteitself) via stack-allocated hex buffers,ReadBytesfor headers, andLastIndexBytefor size parsing.Test plan
go test ./gitindex/ -racepasses (all existing tests + 2 new)go vet ./gitindex/cleangolangci-lint run ./gitindex/— no new issuesTestReadBlobsPipelined(normal, empty, binary, large blobs),TestReadBlobsPipelined_WithMissingBenchmarkBlobRead_GoGit(baseline),BenchmarkBlobRead_CatfilePipelined(production path)torvalds/linux)🤖 Generated with Claude Code